-
Notifications
You must be signed in to change notification settings - Fork 162
CUDA: muh faster prompt processing for MoE models and small u-batch sizes #728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
So we are sure it works with per row scales
|
Very interesting! I did encounter slow prompt processing performance for GLM 4.5 Air on pure CUDA at I'm having some trouble getting this PR to compile on Windows. It seems to be encountering the same error building multiple mmq instance files: |
|
@DocShotgun Thanks! There was the |
Built successfully now! Here are some preliminary numbers on Windows 11 with RTX PRO 6000 96gb on GLM-4.5-Air-Q5_K_S. Ran the same tests that you did above. It's a HUGE speedup on Windows lol. ik_llama.cpp PR vs ik_llama.cpp main
I compared it to current mainline llama.cpp as well (with Johannes's original PR merged). ik_llama.cpp PR vs llama.cpp main
|
|
No real changes for hybrid inference. Something made PP fall when I went from batch 1024 to 512, previously it was over 100 on qwen-235b and now it's 66, but I don't think that it was this commit. |
I think something happened in the CUDA graphs PR, but I have not taken the time to measure commits yet and find it. |
|
I may also be mistaking qwen 512 perf for GLM perf. I tried rolling back to 23fe18c with no change. I'll try to roll back to before cuda graphs later today. |
|
Because @DocShotgun brought up a comparison with mainline
|
Of course not. For hybrid inference such small batches may not even be offloaded to the GPU (batch size must be greater than
Yes, it can be very useful to take notes, so one does not live with the concept that something got broken along the way. |
Do you think mainline |
|
Funny enough, I keep notes with benchmarks but the file is getting unwieldy. Got a small 512 bump from |
|
This PR might have provoked a perplexity bump on GLM 4.5 Air. I inquried about a perplexity bump on the recent main, as I was trying your commit allowing to quantize ffn_gate_imp, the PPL was higher than expected, including on the source model (the following quant in my 'expose'). I rolled back from main to isolate the problem on PR 728. OS used, Win 11. Full Cuda offload on 3 Ampere GPUs. Before this PR 728 : Final estimate: PPL = 4.6508 +/- 0.02854 / orig before Muh PR. After this PR 728 : Final estimate: PPL = 4.6772 +/- 0.02874 / after Muh PR. I made a compromise for myself between speed and quality by shelving 12 PRs, making a clean rollback. Here's the modified branch : https://github.com/Nexesenex/ik_llama.cpp.nxs/tree/Rolled-back-to-pre-PR728 final estimate: PPL = 4.6544 +/- 0.02857 / branch rolled back from main until pre-PR 728. Disabling MMAD on this branche helps quality furthermore: Final estimate: PPL = 4.6511 +/- 0.02854 / branch rolled back from main until pre-PR 728. -no-mmad |
|
What PPL do we get with mainline Btw, you don't need to roll back 12 PR and all that. To arrive at the behavior prior to this PR, all you need to do is change ik_llama.cpp/ggml/src/ggml-cuda.cu Line 2688 in 575e2c2
to if (false && src1->ne[2] <= 32*src0->ne[2] |
|
OK, I can confirm the issue exists in mainline if (ggml_cuda_should_use_mmq(src0->type, cc, ne12)) {
ggml_cuda_mul_mat_q(ctx, src0, src1, ids, dst);
return;
}on lines 2285-2288 in if (false && ggml_cuda_should_use_mmq(src0->type, cc, ne12)) {
ggml_cuda_mul_mat_q(ctx, src0, src1, ids, dst);
return;
}
The PPL increase appears to be systematic. Differences of 0.4% are well beyond the typical range caused by numerical round off. Hence, it is likely that there is an issue in mainline PR 15525, so pinging @JohannesGaessler as the author of 15525. |
I suspected that PR first, and saw that line, but then I tried to undo the commit and saw conflicts, so I concluded that it could be a more complex issue and that changing a trigger might not cut the deal, so I rolled-back everything to be absolutely certain of myself. I didn't check on mainline, though, I was sleepy. :D My variances are in the same order of magnitude than yours, and they indeed went way beyond the 0.01% or so which is "acceptable". Anyway, showtime for the big guys! ^^ |
|
@Nexesenex See #910. With this, using |
|
Thank you for notifying me. I am seeing an issue specifically on my RTX 5090, with no other GPU being affected. To be clear, the hardware you were using is pre-Blackwell, right? (Could be the same bug, but manifesting differently depending on SM count.) |
|
I'm using RTX-4080. @Nexesenex is using 3090s IIRC.
You mean you computed perplexity on a bunch of MoE models, and the result is the same pre- and post MUL-MAT-ID optimization on all of the many GPUs that you have, except on the 5090? |
|
No, I mean that on my RTX 5090 (which I didnt yet have back then) there is a very clear problem with ppl increasing by 50% so I'll debug that first. I think it has to do with the SM count not being a power of 2 which would affect all GPUs in question. |
|
I'm not a GPU guru, but the SM count being a power of 2 is not something I would expect. It is 76 on my 4080, and 82 on the 3090 (which is a pretty popular choice among LLM enthusiasts). Anyway, I hope you will find the issue. Thanks for looking into it! |
|
The reason I suspect it is simply because for my implementations of stream-k decompositions that has historically been a frequent source of bugs. The testing I did was simply GraniteMoe with master (+50% ppl vs. RTX 4090 with 128 SMs), with your patch (fixes ppl), and with a patch that overrides the SM count (from the perspective of ggml) to 32 (also fixes ppl). The issue has persisted since back when I implemented this feature on the upstream repository. |
ik_llama.cpphas excellent CUDA prompt processing (PP) performance for MoE models when using large batch and u-batch sizes (>= 2048). But for smaller batch sizes PP performance rapidly decreases with decreasing u-batch size. This is not desirable as larger u-batches result in larger CUDA compute buffers, which reduces the maximum context length one can use and/or the number of MoE layers one can offload to the GPU.This PR remedies the situation by bringing massive performance improvement for small u-batch sizes.
For this PR I'm standing on the shoulder of giants. Actually, it is a single giant, Johannes Gaessler. The PR derived from his PR 15525 in mainline
llama.cpp. But, as it is always the case these days, it required quite a bit of adaptation. All additionalik_llama.cppquants needed to be added as well to the new kernels, which of course required to get rid of the almighty assumption in mainline code that all quantized tensor data consist of blocks of a fixed size arranged contiguously in memory.At this point it is a bit of an append-only programming style, but I'll clean up in subsequent PRs.
Here is a performance comparison between the main branch and this PR for DeepSeek-Lite quantized with
Q4_0as a function of u-batch size for a fixed batch size of 4096 on Linux. GPU is RTX-4080. Flash attention is on,mla = 3, fmoe = 1.The PR massively reduces the number of kernel launches during prompt processing. Hence, I wouldn't be surprised if on Windows, where kernel launches are much more expensive, the performance impact is much higher than on Linux.
Worth noting that the new MoE implementation becomes less efficient than the existing implementation somewhere around u-batch size of 2000 tokens. Because of that,
ik_llama.cppswitches to the old MoE implementation foru-batch > 2048. I expect that the threshold where the old implementation is better will be model dependent (most likely dependent on ration between total and active experts), but haven't studied this in detail, so left for a future tuning.